-
Notifications
You must be signed in to change notification settings - Fork 233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Local Execution #1305
Local Execution #1305
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing this currently does not build. Now that's it's pulled out into a separate library it would be nice to prioritize that as well as add the build into CI to help push the code towards being mergeable.
Also, it would be nice to start adding some tests not aiming at full coverage, but just as documentation of how you're envisioning some of the APIs here being used (essentially as machine-checkable documentation)
Reviewed 137 of 137 files at r1, all commit messages.
Reviewable status: all files reviewed, 67 unresolved discussions (waiting on @reyna-abhyankar)
lib/CMakeLists.txt
line 3 at r1 (raw file):
add_subdirectory(pcg) add_subdirectory(compiler) add_subdirectory(execution)
Might be better renamed "local-execution"--"execution" by itself is a bit vague
Code quote:
execution
lib/README.md
line 8 at r1 (raw file):
- `kernels`: - `op-attrs`: - `execution`:
Might be worth adding a quick summary of what the library does, especially as this one is a bit less trivial than some of the others
Code quote:
- `execution`:
lib/execution/CMakeLists.txt
line 13 at r1 (raw file):
op-attrs utils optional
Use std::optional
instead and remove the optional
dependency
lib/execution/CMakeLists.txt
line 14 at r1 (raw file):
utils optional compiler
Why does this need compiler?
lib/execution/include/local_execution/local_allocator.h
line 11 at r1 (raw file):
struct LocalAllocator : public IAllocator { LocalAllocator(size_t);
Worth including a parameter name here since it's not clear what this actually does.
Suggestion:
LocalAllocator(size_t ?????);
lib/execution/include/local_execution/local_allocator.h
line 14 at r1 (raw file):
~LocalAllocator() override; void *allocate(Tensor) override;
The code to calculate the memory size needed for a tensor is shared between allocators, so it seems worth making this the interface rather than forcing each allocator to compute the size itself
Suggestion:
void *allocate(size_t) override;
lib/execution/include/local_execution/local_allocator.h
line 16 at r1 (raw file):
void *allocate(Tensor) override; void deallocate(void *) override; size_t get_ptr_memory_size(void *);
Why?
lib/execution/include/local_execution/local_allocator.h
line 21 at r1 (raw file):
size_t total_memory_size; size_t allocated_memory_size; std::unordered_map<void *, size_t> ptr_memory_size_mapping;
Why? Isn't this already handled by cudaMalloc
?
lib/execution/include/local_execution/local_allocator.h
line 22 at r1 (raw file):
size_t allocated_memory_size; std::unordered_map<void *, size_t> ptr_memory_size_mapping; };
Add CHECK_RC_COPY_VIRTUAL_COMPLIANT(LocalAllocator);
. This checks the conditions in https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#Rc-copy-virtual
lib/execution/include/local_execution/local_model_training_instance.h
line 14 at r1 (raw file):
namespace FlexFlow { struct LocalModelTrainingInstance {
Maybe worth splitting out a TrainingInstance
to remove duplication between LocalModelTrainingInstance
and ModelTrainingInstance
(or whatever it is called now if you changed its name)
lib/execution/include/local_execution/local_model_training_instance.h
line 34 at r1 (raw file):
local_training_backing); void initialize_backing(LocalModelTrainingInstance &);
See https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-validate/
Suggestion:
LocalModelTrainingInstance initialize_backing(ComputationGraph const &, Optimizer const &, etc.);
lib/execution/include/local_execution/local_task_argument_accessor.h
line 15 at r1 (raw file):
template <typename T> T const &get_argument(slot_id) const;
Same for all methods here I think
Suggestion:
template <typename T>
T const &get_argument(slot_id) const override;
lib/execution/include/local_execution/local_task_argument_accessor.h
line 37 at r1 (raw file):
private: Allocator allocator; std::unordered_map<std::pair<slot_id, IsGrad>, GenericTensorAccessorW> tensor_backing_map;
Pull out into a separate struct that has a name. Raw pair
s, variant
s, etc. tend to be confusing as it's not clear what they do.
Code quote:
std::pair<slot_id, IsGrad>
lib/execution/include/local_execution/local_task_argument_accessor.h
line 44 at r1 (raw file):
}
CHECK_RC_COPY_VIRTUAL_COMPLIANT
lib/execution/include/local_execution/local_training_backing.h
line 28 at r1 (raw file):
struct LocalTrainingBacking { LocalTrainingBacking(ComputationGraph, Allocator, std::unordered_map<OperatorSlotBackingId, GenericTensorAccessorW>);
Suggestion:
LocalTrainingBacking(ComputationGraph const &, Allocator const &, std::unordered_map<OperatorSlotBackingId, GenericTensorAccessorW> const &);
lib/execution/include/local_execution/local_training_backing.h
line 35 at r1 (raw file):
void execute_update(); LocalTaskArgumentAccessor get_fwd_accessor(OpTaskInvocation);
Suggestion:
TaskArgumentAccessor get_fwd_accessor(OpTaskInvocation);
lib/execution/include/local_execution/local_training_backing.h
line 36 at r1 (raw file):
LocalTaskArgumentAccessor get_fwd_accessor(OpTaskInvocation); LocalTaskArgumentAccessor get_bwd_accessor(OpTaskInvocation);
Suggestion:
TaskArgumentAccessor get_bwd_accessor(OpTaskInvocation);
lib/execution/include/local_execution/local_training_backing.h
line 40 at r1 (raw file):
private: Allocator allocator; std::vector<Node> topologically_ordered_graph;
Just use the ComputationGraph
you already have until we actually learn the performance is that important.
Code quote:
std::vector<Node> topologically_ordered_graph;
lib/execution/include/local_execution/local_training_backing.h
line 43 at r1 (raw file):
// memory std::unordered_map<OperatorSlotBackingId, GenericTensorAccessorW> op_slot_tensor_mapping;
What's happening with args (i.e., not tensors)?
lib/execution/include/local_execution/local_training_backing.h
line 50 at r1 (raw file):
template <typename DeviceState> std::unordered_map<task_id_t, TaskImplFunction<DeviceState>> task_id_impl_mapping; std::unordered_map<task_id_t, OpTaskSignature> task_id_signature_mapping;
Probably worth pulling out into a separate struct that's responsible for the internals of task registration to avoid LocalTrainingBacking
from becoming monolithic and unwieldy
Code quote:
// hold mappings
std::unordered_map<task_id_t, void*> task_id_impl_mapping;
// TODO: add init task mapping
template <typename DeviceState>
std::unordered_map<task_id_t, TaskImplFunction<DeviceState>> task_id_impl_mapping;
std::unordered_map<task_id_t, OpTaskSignature> task_id_signature_mapping;
lib/execution/include/local_execution/task_argument_accessor.h
line 12 at r1 (raw file):
namespace FlexFlow { using PrivilegeType = variant<privilege_mode_to_accessor<Permissions::RW>,
We have std::variant
now that we're on C++17
Suggestion:
std::variant
lib/execution/src/task_spec/LocalTaskBackend.md
line 27 at r1 (raw file):
Retriever
Is this still up to date? I don't recall seeing this class. We should make sure all the documentation is up-to-date before merging--feel free to draft @abisatish to help out with any doc writing and updating (afew docstrings would be good too as the design solidifies)
lib/execution/src/task_spec/OpTaskSpec.md
line 1 at r1 (raw file):
# OpTaskSpec
Worth coordinating with @abisatish to try to make the documentation around operators, task_spec, etc. a bit more cohesive and approachable
lib/execution/src/task_spec/op_task_signature.cc
line 8 at r1 (raw file):
void OpTaskSignature::add_input_slot(slot_id name, SlotType slot_type = SlotType::TENSOR) { OpTensorSlotSpec op_tensor_slot_spec {
I like to do this to signify that this is just a composite of these values, rather than some object that might be doing nontrivial things in a constructor.
Suggestion:
OpTensorSlotSpec op_tensor_slot_spec = {
lib/execution/src/task_spec/TaskSpecDSL.md
line 0 at r1 (raw file):
Is this still up-to-date?
lib/execution/src/task_spec/TaskSpecDSL.md
line 3 at r1 (raw file):
# Task Spec DSL There are three components in task-based systems: `agent`, `task`, and `backend`. At a high level, they interact with each other in the following way: an `agent` defines a `task` which is executed by the `backend`. In FlexFlow, an `agent` could be an operator (e.g., `Conv2D`), a `task` could be its `forward()` function, and this could be executed on a `backend` like the Legion runtime. A more complicated example is a deep neural network, in which there are many `agent`s dispatching many `task`s that may depend on each other. In this document, we will define the three terms and their interactions.
Why add the concept of an agent
? I'm not against describing it here in a more object-oriented vs functional way, but I'm not sure why it's necessary? Why not just treat tasks as functions?
lib/execution/src/task_spec/TaskSpecInternals.md
line 0 at r1 (raw file):
Would be nice to combine some of these docs and make them a bit more cohesive. Also, it would be good to explain what capabilities the task spec interface provides to code implementing tasks (argument accessor, bindings, etc.) independent of how they're implemented in the local backing. Also would be nice to have some documentation explaining why we have the backings and how they related to each other (i.e., they are not equivalent, but must be able to handle all operators in a semantics-preserving if not performance-preserving way)
lib/execution/src/task_spec/local_backing/local_allocator.cc
line 24 at r1 (raw file):
size_t freed_memory_size = this->ptr_memory_size_mapping[ptr]; checkCUDA(cudaFree(ptr)); this->allocated_memory_size -= freed_memory_size;
Why track allocated memory in LocalAllocator
? Might be worth instead having a separate IAllocator
implementation that wraps an IAllocator
with this functionality instead
Code quote:
this->allocated_memory_size
lib/execution/src/task_spec/local_backing/local_allocator.cc
line 37 at r1 (raw file):
} LocalAllocator::~LocalAllocator() {
Why was this chosen over not owning all of the memory regions (i.e., whoever uses them is responsible for deleting them)? No strong opinion, just curious
lib/execution/src/task_spec/local_backing/local_model_training_instance.cc
line 16 at r1 (raw file):
} GenericTensorAccessorR forward(LocalModelTrainingInstance const & local_model_training_instance) {
Why return this over having the LocalModelTrainingInstance
just write out to an externally-provided tensor that's encoded when the instance is created?
Code quote:
GenericTensorAccessorR
lib/execution/src/task_spec/local_backing/local_task_argument_accessor.cc
line 9 at r1 (raw file):
template <typename T> T const & LocalTaskArgumentAccessor::get_argument(slot_id) const { not_implemented()
Semicolons?
lib/execution/src/task_spec/local_backing/local_task_argument_accessor.cc
line 38 at r1 (raw file):
template <Permissions PRIV> privilege_mode_to_accessor<PRIV> LocalTaskArgumentAccessor::get_tensor_grad(slot_id slot) const { return this->get_tensor(slot_id, is_grad = true);
What is this syntax?
Code quote:
is_grad = true
lib/execution/src/task_spec/local_backing/local_training_backing.cc
line 12 at r1 (raw file):
// operator/slot --> GTAR LocalTrainingBacking::LocalTrainingBacking(ComputationGraph computation_graph,
This function is quite long and seems to do a couple different things--might be worth splitting up into multiple functions/objects that are composed together for readability
lib/execution/src/task_spec/local_backing/local_training_backing.cc
line 14 at r1 (raw file):
LocalTrainingBacking::LocalTrainingBacking(ComputationGraph computation_graph, Allocator allocator, std::unordered_map<OperatorSlotBackingId, GenericTensorAccessorW> allocated_tensors) {
Suggestion:
LocalTrainingBacking::LocalTrainingBacking(ComputationGraph const &computation_graph,
Allocator const &allocator,
std::unordered_map<OperatorSlotBackingId, GenericTensorAccessorW> const &preallocated_tensors) {
lib/execution/src/task_spec/local_backing/local_training_backing.cc
line 15 at r1 (raw file):
Allocator allocator, std::unordered_map<OperatorSlotBackingId, GenericTensorAccessorW> allocated_tensors) { this->op_slot_tensor_mapping.insert(allocated_tensors.begin(), allocated_tensors.end());
Just add to initializer list?
Suggestion:
: op_slot_tensor_mapping(allocated_tensors)
lib/execution/src/task_spec/local_backing/local_training_backing.cc
line 17 at r1 (raw file):
this->op_slot_tensor_mapping.insert(allocated_tensors.begin(), allocated_tensors.end()); std::vector<Node> layer_nodes = get_topological_ordering(computation_graph);
Add overload so it returns a std::vector<layer_guid_t>
lib/execution/src/task_spec/local_backing/local_training_backing.cc
line 18 at r1 (raw file):
std::vector<Node> layer_nodes = get_topological_ordering(computation_graph); for (Node node: layer_nodes) {
Suggestion:
Node const &node
lib/execution/src/task_spec/local_backing/local_training_backing.cc
line 19 at r1 (raw file):
std::vector<Node> layer_nodes = get_topological_ordering(computation_graph); for (Node node: layer_nodes) { Layer layer = computation_graph.value().at(node); // operator_guid_t doesn't seem as expressive as layer -- how to get attrs?
Add overload for layer_guid_t
Code quote:
computation_graph.value().at(node);
lib/execution/src/task_spec/local_backing/local_training_backing.cc
line 21 at r1 (raw file):
Layer layer = computation_graph.value().at(node); // operator_guid_t doesn't seem as expressive as layer -- how to get attrs? std::vector<task_id_t> task_ids = get_task_ids(layer.attrs); // still think we need this, since we can't assume all ops have an init task for (task_id_t task_id: task_ids) {
Gentle reminder that format.sh
before requesting review is appreciated 🙂
lib/execution/src/task_spec/local_backing/local_training_backing.cc
line 23 at r1 (raw file):
for (task_id_t task_id: task_ids) { this->task_id_signature_mapping.insert({task_id, get_signature<task_id>()}); this->task_id_impl_mapping.insert({task_id, get_task_impl<task_id>()});
Should it ever be possible for these to diverge? Why two maps and not one?
Code quote:
this->task_id_signature_mapping.insert({task_id, get_signature<task_id>()});
this->task_id_impl_mapping.insert({task_id, get_task_impl<task_id>()});
lib/execution/src/task_spec/local_backing/local_training_backing.cc
line 30 at r1 (raw file):
// TODO: this ^^ should definitely be a test std::unordered_set<MultiDiEdge> outgoing_edges = get_outgoing_edges(computation_graph, node); for (MultiDiEdge edge: outgoing_edges) {
Suggestion:
for (MultiDiEdge const &edge
lib/execution/src/task_spec/local_backing/local_training_backing.cc
line 47 at r1 (raw file):
Tensor tensor = computation_graph.value().at(edge); void * ptr = this->allocator.allocate(tensor); GenericTensorAccessorW tensor_backing = {tensor.data_type, tensor.get_shape(), ptr};
Pull this out into a separate function? Allocating a tensor backing seems like a common operation
Code quote:
void * ptr = this->allocator.allocate(tensor);
GenericTensorAccessorW tensor_backing = {tensor.data_type, tensor.get_shape(), ptr};
lib/execution/src/task_spec/local_backing/local_training_backing.cc
line 55 at r1 (raw file):
// TODO: register update task this->allocator = allocator;
move toinitialization list?
lib/execution/src/task_spec/local_backing/local_training_backing.cc
line 64 at r1 (raw file):
GenericTensorAccessorR LocalTrainingBacking::execute_forward() { for (auto operator_node: this->topologically_ordered_graph) {
Suggestion:
for (layer_guid_t operator_node: get_topological_ordering(this->computation_graph)) {
lib/execution/src/task_spec/local_backing/local_training_backing.cc
line 66 at r1 (raw file):
for (auto operator_node: this->topologically_ordered_graph) { auto attrs = computation_graph.value().at(operator_node).attrs; OpTaskInvocation invocation = forward(operator_node.attrs);
Is there currently any checking anywhere that the OpTaskInvocation
actually complies with the OpTaskSignature
?
lib/execution/src/task_spec/local_backing/local_training_backing.cc
line 74 at r1 (raw file):
void LocalTrainingBacking::execute_backward() { for (auto operator_node: std::reverse(this->topologically_ordered_graph.begin(), this->topologically_ordered_graph.end())) {
Check containers.h
Code quote:
std::reverse(this->topologically_ordered_graph.begin(), this->topologically_ordered_graph.end())
lib/execution/src/task_spec/local_backing/local_training_backing.cc
line 74 at r1 (raw file):
void LocalTrainingBacking::execute_backward() { for (auto operator_node: std::reverse(this->topologically_ordered_graph.begin(), this->topologically_ordered_graph.end())) {
Suggestion:
layer_guid_t
lib/execution/src/task_spec/local_backing/local_training_backing.cc
line 78 at r1 (raw file):
OpTaskInvocation invocation = backward(attrs); LocalTaskArgumentAccessor accessor = this->get_local_task_arg_accessor(invocation); task_id_impl_mapping[task_id](accessor);
Probably worth pulling out into a separate method for readability
Suggestion:
this->call_task_impl(task_id, accessor);
lib/execution/src/task_spec/local_backing/local_training_backing.cc
line 87 at r1 (raw file):
LocalTaskArgumentAccessor LocalTrainingBacking::get_local_task_arg_accessor(OpTaskInvocation invocation) {
Suggestion:
get_task_argument_accessor
lib/execution/src/task_spec/local_backing/local_training_backing.cc
line 87 at r1 (raw file):
LocalTaskArgumentAccessor LocalTrainingBacking::get_local_task_arg_accessor(OpTaskInvocation invocation) {
Suggestion:
TaskArgumentAccessor
lib/execution/src/task_spec/local_backing/local_training_backing.cc
line 88 at r1 (raw file):
LocalTaskArgumentAccessor LocalTrainingBacking::get_local_task_arg_accessor(OpTaskInvocation invocation) { LocalTaskArgumentAccessor local_task_arg_acc (this->allocator);
Maybe make LocalTaskArgumentAccessor
immutable rather than being incrementally constructed with insert_tensor
? Not critical, just worth considering
lib/execution/src/task_spec/local_backing/local_training_backing.cc
line 94 at r1 (raw file):
std::pair<slot_id, IsGrad> tensor_id = tensor_binding->first; operator_guid_t op_guid = invocation.get_operator_guid_t(); GenericTensorAccessorW tensor_backing = this->op_slot_tensor_mapping[{op_guid, tensor_id->first}];
Suggestion:
GenericTensorAccessorW tensor_backing = this->op_slot_tensor_mapping.at({op_guid, tensor_id->first});
lib/kernels/include/kernels/allocation.h
line 11 at r1 (raw file):
struct IAllocator { virtual void *allocate(Tensor) = 0;
Calculating the memory size of a tensor isn't an allocator-specific operation, so let's keep the IAllocator
API like this. If you wanted this for convenience, I have no issue with putting it on Allocator
. Just keep in mind that everything that IAllocator
defines must be implemented by each implementation of IAllocator
Suggestion:
virtual void *allocate(size_t) = 0;
lib/kernels/include/kernels/allocation.h
line 20 at r1 (raw file):
Allocator() = delete; void *allocate(Tensor);
Suggestion:
void *allocate(TensorShape const &);
lib/op-attrs/include/op-attrs/get_task_ids.h
line 0 at r1 (raw file):
Should be moved to local-execution
--task ids are not a concept that exists at the level of op-attrs
lib/op-attrs/include/op-attrs/get_task_ids.h
line 50 at r1 (raw file):
template <typename... Ts> std::vector<task_id_t> get_task_ids(variant<Ts...> const &attrs) { return visit(GetTaskIdsFunctor{}, attrs);
Just use an auto lambda
lib/op-attrs/src/datatype.cc
line 5 at r1 (raw file):
namespace FlexFlow { size_t size_of(DataType data_type) {
Let's not make this one underscore off from a really annoying to debug typo 😂. Currently if you forget an underscore you get the wrong result and there's no type error to stop you
Suggestion:
size_t size_of_datatype(DataType data_type) {
lib/op-attrs/src/datatype.cc
line 20 at r1 (raw file):
return sizeof(double); default: throw mk_runtime_error("Unknown data type");
I think mk_runtime_error
supports fmt
syntax (if I'm wrong ignore me)
Suggestion:
throw mk_runtime_error("Unknown data type {}", dt);
lib/runtime/src/task_invocation_compilation.cc
line 0 at r1 (raw file):
Is there a reason we're deleting all this? I'm not advocating moving it to local-execution, but keeping it around for now for Tanmaey's sanity might be nice (same for a couple other files deleted in this PR that would actually be quite useful for implementing the legion backend)
lib/runtime/src/tasks.h
line 187 at r1 (raw file):
template <task_id_t> void* get_task_impl();
Why not use that nice variant
of std::function
s you defined somewhere else? void *
seems maybe a little drastic, as it doesn't even ensure the thing is a function, and simultaneously provides no hints to anyone reading the code what this function does?
lib/execution/include/task_spec/typed_future.h
line 32 at r1 (raw file):
} ArgTypeRuntimeTag get_type_tag() const {
Why?
lib/execution/include/task_spec/typed_future.h
line 58 at r1 (raw file):
std::type_index type_idx; Legion::Future future; ArgTypeRuntimeTag type_tag;
Should replace std::type_index
, not accompany it.
lib/execution/include/task_spec/typed_future_map.h
line 55 at r1 (raw file):
std::type_index type; Legion::FutureMap future_map;
This is going to have issues since local-execution
should not depend on legion.
lib/execution/include/task_spec/typed_future_map.h
line 56 at r1 (raw file):
std::type_index type; Legion::FutureMap future_map; ArgTypeRuntimeTag type_tag;
Should replace std::type_index
.
Also I'm not sure if this is necessary, since ArgTypeRuntimeTag
currently (i.e., in the main repo-refactor
branch) primarily exists to enable legion serialization, which is not necessary for Futures (or for local execution at all)
lib/execution/include/task_spec/typed_task_invocation.h
line 120 at r1 (raw file):
std::type_index type_idx; std::shared_ptr<TaskInvocation const> invocation; ArgTypeRuntimeTag type_tag;
Should replace std::type_index
lib/execution/src/task_spec/op_task_invocation.h
line 66 at r1 (raw file):
return spec.get_type_tag().get_type_idx(); } };
auto lambda?
Code quote:
struct OpArgSpecTypeAccessor {
std::type_index operator()(ConcreteArgSpec& spec) {
return spec.get_type_tag().get_type_idx();
}
std::type_index operator()(IndexArgSpec& spec) {
return spec.get_type_tag().get_type_idx();
}
std::type_index operator()(OpArgRefSpec& spec) {
return spec.get_type_tag().get_type_idx();
}
std::type_index operator()(CheckedTypedFuture& spec) {
return spec.get_type_tag().get_type_idx();
}
std::type_index operator()(CheckedTypedFutureMap& spec) {
return spec.get_type_tag().get_type_idx();
}
std::type_index operator()(RuntimeArgRefSpec& spec) {
return spec.get_type_tag().get_type_idx();
}
std::type_index operator()(TaskInvocationSpec& spec) {
return spec.get_type_tag().get_type_idx();
}
};
lib/execution/src/task_spec/op_task_signature.h
line 91 at r1 (raw file):
private: OpTaskType type; std::vector<std::type_index> return_value;
Do we need multiple return values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 112 of 144 files reviewed, 67 unresolved discussions (waiting on @lockshaw)
lib/README.md
line 8 at r1 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Might be worth adding a quick summary of what the library does, especially as this one is a bit less trivial than some of the others
Done.
lib/execution/CMakeLists.txt
line 13 at r1 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Use
std::optional
instead and remove theoptional
dependency
Done.
lib/execution/CMakeLists.txt
line 14 at r1 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why does this need compiler?
Done.
lib/execution/include/local_execution/local_allocator.h
line 22 at r1 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Add
CHECK_RC_COPY_VIRTUAL_COMPLIANT(LocalAllocator);
. This checks the conditions in https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#Rc-copy-virtual
Done.
lib/execution/include/local_execution/local_task_argument_accessor.h
line 37 at r1 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Pull out into a separate struct that has a name. Raw
pair
s,variant
s, etc. tend to be confusing as it's not clear what they do.
Why not just using SlotGradId = std::pair<slot_id, IsGrad>
?
lib/execution/include/local_execution/local_task_argument_accessor.h
line 44 at r1 (raw file):
Previously, lockshaw (Colin Unger) wrote…
CHECK_RC_COPY_VIRTUAL_COMPLIANT
Done.
lib/execution/include/local_execution/local_training_backing.h
line 28 at r1 (raw file):
struct LocalTrainingBacking { LocalTrainingBacking(ComputationGraph, Allocator, std::unordered_map<OperatorSlotBackingId, GenericTensorAccessorW>);
Done.
lib/execution/include/local_execution/local_training_backing.h
line 35 at r1 (raw file):
void execute_update(); LocalTaskArgumentAccessor get_fwd_accessor(OpTaskInvocation);
Done.
lib/execution/include/local_execution/local_training_backing.h
line 36 at r1 (raw file):
LocalTaskArgumentAccessor get_fwd_accessor(OpTaskInvocation); LocalTaskArgumentAccessor get_bwd_accessor(OpTaskInvocation);
Done.
lib/execution/include/local_execution/local_training_backing.h
line 43 at r1 (raw file):
Previously, lockshaw (Colin Unger) wrote…
What's happening with args (i.e., not tensors)?
Yeah, have to add that based on our discussion yesterday.
lib/execution/include/local_execution/task_argument_accessor.h
line 12 at r1 (raw file):
Previously, lockshaw (Colin Unger) wrote…
We have
std::variant
now that we're on C++17
Done.
lib/execution/src/task_spec/op_task_signature.cc
line 8 at r1 (raw file):
Previously, lockshaw (Colin Unger) wrote…
I like to do this to signify that this is just a composite of these values, rather than some object that might be doing nontrivial things in a constructor.
Done.
lib/execution/src/task_spec/local_backing/local_task_argument_accessor.cc
line 9 at r1 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Semicolons?
Done.
lib/execution/src/task_spec/local_backing/local_task_argument_accessor.cc
line 38 at r1 (raw file):
Previously, lockshaw (Colin Unger) wrote…
What is this syntax?
Done.
lib/execution/src/task_spec/local_backing/local_training_backing.cc
line 14 at r1 (raw file):
LocalTrainingBacking::LocalTrainingBacking(ComputationGraph computation_graph, Allocator allocator, std::unordered_map<OperatorSlotBackingId, GenericTensorAccessorW> allocated_tensors) {
Done.
lib/execution/src/task_spec/local_backing/local_training_backing.cc
line 15 at r1 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Just add to initializer list?
Done.
lib/execution/src/task_spec/local_backing/local_training_backing.cc
line 18 at r1 (raw file):
std::vector<Node> layer_nodes = get_topological_ordering(computation_graph); for (Node node: layer_nodes) {
Done.
lib/execution/src/task_spec/local_backing/local_training_backing.cc
line 21 at r1 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Gentle reminder that
format.sh
before requesting review is appreciated 🙂
Done.
lib/execution/src/task_spec/local_backing/local_training_backing.cc
line 23 at r1 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Should it ever be possible for these to diverge? Why two maps and not one?
Done.
lib/execution/src/task_spec/local_backing/local_training_backing.cc
line 30 at r1 (raw file):
// TODO: this ^^ should definitely be a test std::unordered_set<MultiDiEdge> outgoing_edges = get_outgoing_edges(computation_graph, node); for (MultiDiEdge edge: outgoing_edges) {
Done.
lib/execution/src/task_spec/local_backing/local_training_backing.cc
line 87 at r1 (raw file):
LocalTaskArgumentAccessor LocalTrainingBacking::get_local_task_arg_accessor(OpTaskInvocation invocation) {
Done.
lib/execution/src/task_spec/local_backing/local_training_backing.cc
line 87 at r1 (raw file):
LocalTaskArgumentAccessor LocalTrainingBacking::get_local_task_arg_accessor(OpTaskInvocation invocation) {
Done.
lib/execution/src/task_spec/local_backing/local_training_backing.cc
line 94 at r1 (raw file):
std::pair<slot_id, IsGrad> tensor_id = tensor_binding->first; operator_guid_t op_guid = invocation.get_operator_guid_t(); GenericTensorAccessorW tensor_backing = this->op_slot_tensor_mapping[{op_guid, tensor_id->first}];
Done.
lib/op-attrs/src/datatype.cc
line 5 at r1 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Let's not make this one underscore off from a really annoying to debug typo 😂. Currently if you forget an underscore you get the wrong result and there's no type error to stop you
Done.
lib/op-attrs/src/datatype.cc
line 20 at r1 (raw file):
Previously, lockshaw (Colin Unger) wrote…
I think
mk_runtime_error
supportsfmt
syntax (if I'm wrong ignore me)
Done.
lib/runtime/src/task_invocation_compilation.cc
line at r1 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Is there a reason we're deleting all this? I'm not advocating moving it to local-execution, but keeping it around for now for Tanmaey's sanity might be nice (same for a couple other files deleted in this PR that would actually be quite useful for implementing the legion backend)
So there were actually 2 files with the same name in different subfolders. This deleted one looks like a precursor to the newer file which is in lib/execution/src/task_spec/task_invocation_compilation.cc
and has a bunch of Legion stuff. Let me know if that's correct (I haven't verified which one is actually older via commit history, so I'll do that if you can't immediately tell).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 112 of 144 files reviewed, 67 unresolved discussions (waiting on @lockshaw)
lib/execution/include/local_execution/local_allocator.h
line 16 at r1 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why?
Yeah, I don't think this is needed. I wrote the allocator first and just added that as an API in case but didn't end up using it.
lib/execution/include/local_execution/local_allocator.h
line 21 at r1 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why? Isn't this already handled by
cudaMalloc
?
Yeah, I don't think this is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 112 of 144 files reviewed, 67 unresolved discussions (waiting on @lockshaw)
lib/execution/include/local_execution/local_model_training_instance.h
line 34 at r1 (raw file):
Previously, lockshaw (Colin Unger) wrote…
See https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-validate/
This is a good paradigm. But I'm not connecting the dots for this example. Wouldn't we have already initialized LocalModelTrainingInstance
before calling this function? That's what is expected for forward
, backward
, and update
.
Actually, in that case shouldn't we do LocalModelTrainingInstance::forward()
instead of forward(LocalModelTrainingInstance)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 112 of 145 files reviewed, 67 unresolved discussions (waiting on @lockshaw)
lib/CMakeLists.txt
line 3 at r1 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Might be better renamed "local-execution"--"execution" by itself is a bit vague
Done.
lib/execution/include/local_execution/local_allocator.h
line 11 at r1 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Worth including a parameter name here since it's not clear what this actually does.
Done.
lib/execution/include/local_execution/local_allocator.h
line 14 at r1 (raw file):
Previously, lockshaw (Colin Unger) wrote…
The code to calculate the memory size needed for a tensor is shared between allocators, so it seems worth making this the interface rather than forcing each allocator to compute the size itself
Done.
lib/execution/include/local_execution/local_model_training_instance.h
line 14 at r1 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Maybe worth splitting out a
TrainingInstance
to remove duplication betweenLocalModelTrainingInstance
andModelTrainingInstance
(or whatever it is called now if you changed its name)
Done.
lib/execution/include/local_execution/local_task_argument_accessor.h
line 15 at r1 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Same for all methods here I think
I think to have this kind of polymorphism we need to have a variant that holds all argument types. 2 reasons:
- C++ doesn't allow virtual template functions
- Having a map like below doesn't make sense because we'd need to then have
LocalTaskArgumentAccessor<T>
for a particular argument type which doesn't work.
The variant solution (while it kind of seems like hard-coding) will fix this, and I think since arguments broadly fall into 2-3 categories, this is fine. (I thinkRuntimeArgRefSpec
orOpArgRefSpec
already does some of this)
Code snippet:
template <typename T>
std::unordered_map<slot_id, T> argument_map;
lib/execution/include/local_execution/local_training_backing.h
line 40 at r1 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Just use the
ComputationGraph
you already have until we actually learn the performance is that important.
Done.
lib/execution/include/local_execution/local_training_backing.h
line 50 at r1 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Probably worth pulling out into a separate struct that's responsible for the internals of task registration to avoid
LocalTrainingBacking
from becoming monolithic and unwieldy
I've simplified these maps down, may pull more out.
lib/execution/src/task_spec/local_backing/local_allocator.cc
line 24 at r1 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why track allocated memory in
LocalAllocator
? Might be worth instead having a separateIAllocator
implementation that wraps anIAllocator
with this functionality instead
Sure, i've lined down LocalAllocator
accordingly to not need memory tracking and purely allocate/deallocate.
lib/execution/src/task_spec/local_backing/local_allocator.cc
line 37 at r1 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why was this chosen over not owning all of the memory regions (i.e., whoever uses them is responsible for deleting them)? No strong opinion, just curious
What would the alternative be? My take is that once LocalAllocator
goes out of scope, it should de-allocate all the memory it allocated, but perhaps deallocate
can be private. What do you think?
lib/execution/src/task_spec/local_backing/local_model_training_instance.cc
line 16 at r1 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why return this over having the
LocalModelTrainingInstance
just write out to an externally-provided tensor that's encoded when the instance is created?
Yeah, I'm changing this to what you said.
lib/execution/src/task_spec/local_backing/local_training_backing.cc
line 55 at r1 (raw file):
Previously, lockshaw (Colin Unger) wrote…
move toinitialization list?
Done.
lib/execution/src/task_spec/local_backing/local_training_backing.cc
line 64 at r1 (raw file):
GenericTensorAccessorR LocalTrainingBacking::execute_forward() { for (auto operator_node: this->topologically_ordered_graph) {
Done.
lib/execution/src/task_spec/local_backing/local_training_backing.cc
line 74 at r1 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Check
containers.h
Done.
lib/execution/src/task_spec/local_backing/local_training_backing.cc
line 74 at r1 (raw file):
void LocalTrainingBacking::execute_backward() { for (auto operator_node: std::reverse(this->topologically_ordered_graph.begin(), this->topologically_ordered_graph.end())) {
Done.
lib/execution/src/task_spec/local_backing/local_training_backing.cc
line 78 at r1 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Probably worth pulling out into a separate method for readability
Done.
lib/execution/src/task_spec/local_backing/local_training_backing.cc
line 88 at r1 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Maybe make
LocalTaskArgumentAccessor
immutable rather than being incrementally constructed withinsert_tensor
? Not critical, just worth considering
Yeah, that's better. Done
lib/kernels/include/kernels/allocation.h
line 11 at r1 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Calculating the memory size of a tensor isn't an allocator-specific operation, so let's keep the
IAllocator
API like this. If you wanted this for convenience, I have no issue with putting it onAllocator
. Just keep in mind that everything thatIAllocator
defines must be implemented by each implementation ofIAllocator
Done.
lib/kernels/include/kernels/allocation.h
line 20 at r1 (raw file):
Allocator() = delete; void *allocate(Tensor);
Let's have Allocator
also be the one to return a GenericTensorAccessorW
since that logic is repeated pretty often.
Code snippet:
GenericTensorAccessorW allocate(TensorShape const & shape) {
size_t requested_mem_size = get_volume(shape); // dims * datatype
void* ptr = this->i_allocator.allocate(requested_mem_size);
return {shape.data_type, shape, ptr};
}
lib/op-attrs/include/op-attrs/get_task_ids.h
line 50 at r1 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Just use an auto lambda
Done.
lib/op-attrs/include/op-attrs/get_task_ids.h
line at r1 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Should be moved to
local-execution
--task ids are not a concept that exists at the level ofop-attrs
Done.
lib/execution/include/task_spec/typed_future.h
line 32 at r1 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why?
In op_task_invocation.h
, we have OpArgSpec = variant<ConcreteArgSpec,...,CheckedTypedFuture,...>
with a bunch of arg types. Initially I thought this function lets us visit them uniformly, since all other types have an ArgTypeRuntimeTag
. But if ArgTypeRuntimeTag
is primarily for legion serialization, then we should just use type index and remove the future arg types from OpArgSpec
. We can always extend it in runtime.
lib/execution/include/task_spec/typed_future.h
line 58 at r1 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Should replace
std::type_index
, not accompany it.
See above comment.
lib/execution/include/task_spec/typed_future_map.h
line 55 at r1 (raw file):
Previously, lockshaw (Colin Unger) wrote…
This is going to have issues since
local-execution
should not depend on legion.
See above comment.
lib/execution/include/task_spec/typed_future_map.h
line 56 at r1 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Should replace
std::type_index
.Also I'm not sure if this is necessary, since
ArgTypeRuntimeTag
currently (i.e., in the mainrepo-refactor
branch) primarily exists to enable legion serialization, which is not necessary for Futures (or for local execution at all)
See above comment.
lib/execution/include/task_spec/typed_task_invocation.h
line 120 at r1 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Should replace
std::type_index
See above comment.
lib/execution/src/task_spec/op_task_invocation.h
line 66 at r1 (raw file):
Previously, lockshaw (Colin Unger) wrote…
auto lambda?
Done.
lib/execution/src/task_spec/op_task_signature.h
line 91 at r1 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Do we need multiple return values?
No, I think we agreed earlier there could be at most one.
…exFlow into task-spec-dsl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seem to be a lot of stray commented-out code throughout this PR--it would be nice to try to clean up some of that before this is merged.
Also, try to use more specific and clear names--a lot of the names (e.g., tensor_mapping
) provide essentially no information, making the code way more challenging to read than it needs to be. Decomposing functions, even in instances where the functions are only called from that one location, can also be helpful for this reason--it bundles up pieces of functionality into nice named pieces that I can understand without having to wade through their implementation details
Also, let's try to aim for slightly smaller PRs in the future slightly_smiling_face:--I know a lot of the changes were relatively simple, but it's still a lot of code to go through
Reviewed 5 of 32 files at r2, 109 of 235 files at r4, 125 of 125 files at r5, all commit messages.
Reviewable status: all files reviewed, 126 unresolved discussions (waiting on @reyna-abhyankar)
lib/kernels/include/kernels/allocation.h
line 28 at r5 (raw file):
} void deallocate(GenericTensorAccessorW tensor_backing) {
Suggestion:
void deallocate(GenericTensorAccessorW const &tensor_backing)
lib/kernels/include/kernels/flat_kernels.h
line 4 at r5 (raw file):
#define _FLEXFLOW_OPS_KERNELS_FLAT_KERNELS_H #include "device.h"
Why this change?
lib/kernels/include/kernels/layer_norm_kernels.h
line 10 at r5 (raw file):
namespace FlexFlow { // class LayerNormPerDeviceState : public PerDeviceOpState {
Delete?
lib/kernels/include/kernels/reduce_kernels.h
line 37 at r5 (raw file):
ArrayShape output_shape); // void forward_kernel_wrapper(ReducePerDeviceState const &m,
Delete?
lib/kernels/include/kernels/reverse_kernels.h
line 8 at r5 (raw file):
namespace FlexFlow { using legion_coord_t = long long;
Just use the type directly--having a type named legion_coord_t
in a part of the codebase without legion is just confusing
lib/kernels/include/kernels/split_kernels.h
line 8 at r5 (raw file):
namespace FlexFlow { using legion_coord_t = long long;
Delete
lib/kernels/include/kernels/transpose_kernels.h
line 11 at r5 (raw file):
struct TransposePerDeviceState { int num_dim; req<int> perm[MAX_TENSOR_DIM];
vector
?
lib/kernels/include/kernels/transpose_kernels.h
line 36 at r5 (raw file):
// namespace Internal { // void forward_kernel_wrapper(TransposePerDeviceState const &m,
Delete?
lib/kernels/src/cuda/reverse_kernels.cu
line 20 at r5 (raw file):
namespace FlexFlow { // declare Legion names using legion_coord_t = long long;
Delete
lib/local-execution/include/arg_backing.h
line 19 at r5 (raw file):
using DeviceStates = std::variant<LinearPerDeviceState>; struct OpArgBacking {
visitable?
lib/local-execution/include/arg_backing.h
line 28 at r5 (raw file):
}; using ArgBackingMapping = std::unordered_map<operator_guid_t, OpArgBacking>;
What's the purpose of ArgBackingMapping
here?
lib/local-execution/include/local_allocator.h
line 19 at r5 (raw file):
private: std::unordered_set<void *> ptrs;
Why have a set of pointers?
lib/local-execution/include/local_model_training_instance.h
line 14 at r5 (raw file):
namespace FlexFlow { struct LocalModelTrainingInstance {
Turn into a visitable-style type without methods
lib/local-execution/include/local_task_argument_accessor.h
line 14 at r5 (raw file):
using SlotGradId = std::pair<slot_id, IsGrad>; using TensorBackingOption = std::variant<
Why a variant of maps instead of a map of variants?
lib/local-execution/include/local_task_argument_accessor.h
line 14 at r5 (raw file):
using SlotGradId = std::pair<slot_id, IsGrad>; using TensorBackingOption = std::variant<
Also I think it's generally better to wrap std::variant
s rather than just using them directly
lib/local-execution/include/local_task_argument_accessor.h
line 25 at r5 (raw file):
std::unordered_map<slot_id, ArgRefBacking> argument_map) : allocator(allocator), tensor_backing_map(tensor_backing_map), argument_map(argument_map){};
.cc
file
lib/local-execution/include/local_task_argument_accessor.h
line 30 at r5 (raw file):
ConcreteArgSpec const &get_concrete_arg(slot_id) const override; OpArgRefTypeBacking const &get_op_arg_ref(slot_id) const override;
These should already have been flattened down to all be ConcreteArgSpec
by the type that argument accessor is called--OpArgRefTypeBacking
, RuntimeArgRefTypeBacking
, etc. are just placeholders that should get filled in during taskspec compilation
lib/local-execution/include/local_task_argument_accessor.h
line 33 at r5 (raw file):
RuntimeArgRefTypeBacking const &get_runtime_arg(slot_id) const override; PrivilegeType
What is PrivilegeType
? (and PrivilegeVariadicType
?)
lib/local-execution/include/local_task_argument_accessor.h
line 42 at r5 (raw file):
size_t get_device_idx() const override { return 0;
.cc
file
lib/local-execution/include/local_training_backing.h
line 26 at r5 (raw file):
struct TaskRegistry { TaskRegistry(std::unordered_map<tensor_guid_t, GenericTensorAccessorW &>);
Why is this taking in a map of mutable references? That seems strange (and likely unsafe)
lib/local-execution/include/local_training_backing.h
line 27 at r5 (raw file):
struct TaskRegistry { TaskRegistry(std::unordered_map<tensor_guid_t, GenericTensorAccessorW &>); void register_task(task_id_t, operator_guid_t);
Why is TaskRegistry
tied to a PCG
and specific operator_guid_t
s?
lib/local-execution/include/local_training_backing.h
line 29 at r5 (raw file):
void register_task(task_id_t, operator_guid_t); // void register_args(operator_guid_t, OpArgBacking); bool is_tensor_allocated(tensor_guid_t tensor_id);
Feels like this should be const
(same for a lot of these)
lib/local-execution/include/local_training_backing.h
line 69 at r5 (raw file):
void execute_update(); DeviceSpecific<DeviceStates> call_init_task_impl(task_id_t,
Should this and the methods below be public?
lib/local-execution/include/op_task_invocation.h
line 28 at r5 (raw file):
std::variant<ConcreteArgSpec, OpArgRefSpec, RuntimeArgRefSpec>; struct OpArgSpecTypeAccessor {
Why is this a functor?
lib/local-execution/include/op_task_invocation.h
line 104 at r5 (raw file):
OpTaskBinding binding; }; FF_VISITABLE_STRUCT_NONSTANDARD_CONSTRUCTION(OpTaskInvocation,
Why nonstandard construction?
lib/local-execution/include/op_task_invocation.h
line 111 at r5 (raw file):
OpTaskBinding infer_bwd_binding(OpTaskBinding const &fwd); bool validate_invocation(OpTaskSignature sig, OpTaskInvocation inv) {
.cc
file
lib/local-execution/include/op_task_invocation.h
line 111 at r5 (raw file):
OpTaskBinding infer_bwd_binding(OpTaskBinding const &fwd); bool validate_invocation(OpTaskSignature sig, OpTaskInvocation inv) {
Suggestion:
bool is_invocation_valid(OpTaskSignature sig, OpTaskInvocation inv) {
lib/local-execution/include/op_task_invocation.h
line 112 at r5 (raw file):
bool validate_invocation(OpTaskSignature sig, OpTaskInvocation inv) { // tensors
Just pull into a separate function rather than a comment?
lib/local-execution/include/op_task_invocation.h
line 113 at r5 (raw file):
bool validate_invocation(OpTaskSignature sig, OpTaskInvocation inv) { // tensors auto tensor_bindings = inv.binding.get_tensor_bindings();
Avoid auto
unless the type name is really long
lib/local-execution/include/op_task_invocation.h
line 115 at r5 (raw file):
auto tensor_bindings = inv.binding.get_tensor_bindings(); for (OpTensorSlotSpec const &op_tensor_slot_spec : sig.get_tensor_slots()) { slot_id name = op_tensor_slot_spec.name;
Seems like this is just validating an OpTensorSlotSpec
against an OpTensorSpec
--why not create a separate function for doing that? This current function that lumps everything together is a pain to read.
lib/local-execution/include/op_task_invocation.h
line 117 at r5 (raw file):
slot_id name = op_tensor_slot_spec.name; IsGrad is_grad = op_tensor_slot_spec.is_grad; std::pair<slot_id, IsGrad> tensor_key = std::make_pair(name, is_grad);
Suggestion:
td::pair<slot_id, IsGrad> tensor_key = std::make_pair(
op_tensor_slot_spec.name,
op_tensor_slot_spec.is_grad
);
lib/local-execution/include/op_task_invocation.h
line 128 at r5 (raw file):
auto sig_arg_types = sig.get_arg_types(); OpArgSpecTypeAccessor type_accessor; for (auto arg_binding : inv.binding.get_arg_bindings()) {
Pull out into a separate funciton
lib/local-execution/include/op_task_invocation.h
line 132 at r5 (raw file):
OpArgSpec op_arg_spec = arg_binding.second; std::type_index arg_type = sig_arg_types.at(name); if (type_accessor(op_arg_spec) != arg_type) {
Not really a helpful or clear name
Code quote:
type_accessor
lib/local-execution/src/local_model_training_instance.cc
line 13 at r5 (raw file):
EnableProfiling enable_profiling, ProfilingSettings profiling_settings) { this->local_training_backing = LocalTrainingBacking(computation_graph,
Rather than delegating everything why not just have LocalModelTrainingInstance
take in a LocalTrainingBacking
or the other way around?
lib/local-execution/src/local_task_argument_accessor.cc
line 41 at r5 (raw file):
std::pair<slot_id, IsGrad> slot_grad_pair = std::make_pair(slot, is_grad); auto variadic_tensor_backing_map = std::get< std::unordered_map<SlotGradId, std::vector<GenericTensorAccessorW>>>(
Wrap your variant
s and tuple
s to avoid code like this
lib/local-execution/src/local_training_backing.cc
line 21 at r5 (raw file):
void TaskRegistry::register_task(task_id_t task_id, operator_guid_t op_id) { TaskSignatureImpl task_signature_impl = {get_task_impl<task_id>(),
Suggestion:
TaskImplWithSignature
lib/local-execution/src/local_training_backing.cc
line 40 at r5 (raw file):
bool TaskRegistry::is_tensor_allocated(tensor_guid_t tensor_id) { return this->tensor_mapping.find(tensor_id) != this->tensor_mapping.end();
contains
in containers.h
And const
lib/local-execution/src/local_training_backing.cc
line 45 at r5 (raw file):
GenericTensorAccessorW & TaskRegistry::get_tensor_backing(tensor_guid_t tensor_id) { return this->tensor_mapping.at(tensor_id);
Rather unclear name--from this name it could be a mapping from each tensor to its favorite color for all I know 🙂
Code quote:
tensor_mapping
lib/local-execution/src/local_training_backing.cc
line 64 at r5 (raw file):
this->task_registry = TaskRegistry(allocated_tensors); std::vector<operator_guid_t> layers = computation_graph.traverse(); for (operator_guid_t const &node : layers) {
traverse
is also really unclear, do you mean get_topological_ordering
?
Suggestion:
for (operator_guid_t const &node : computation_graph.traverse()) {
lib/local-execution/src/local_training_backing.cc
line 69 at r5 (raw file):
for (task_id_t task_id : task_ids) { this->task_registry.register_task(task_id, node); }
All of the intermediate variables make the code rather hard to read--they don't need to be eliminated entirely, but it would really help readability if the number was reduced a bit
Suggestion:
CompGraphOperatorAttrs attrs = computation_graph.get_layer_attrs(node);
for (task_id_t task_id : get_task_ids(attrs)) {
this->task_registry.register_task(task_id, node);
}
lib/local-execution/src/local_training_backing.cc
line 98 at r5 (raw file):
this->task_registry.get_init_signature(operator_node), invocation)); TaskArgumentAccessor accessor = this->get_task_arg_accessor(invocation); DeviceSpecific<DeviceStates> device_state =
Why are we just throwing away the per-device state here?
lib/local-execution/src/local_training_backing.cc
line 111 at r5 (raw file):
this->task_registry.task_mapping.at(task_id); auto fn = std::get<std::function<DeviceStates(TaskArgumentAccessor const &)>>( task_sig_impl.impl_function);
Suggestion:
auto task_impl_function = std::get<std::function<DeviceStates(TaskArgumentAccessor const &)>>(
task_sig_impl.impl_function);
lib/local-execution/src/local_training_backing.cc
line 147 at r5 (raw file):
TaskArgumentAccessor accessor = this->get_task_arg_accessor(invocation, operator_guid_t); this->call_task_impl(invocation.task_id, accessor);
This seems duplicated a couple times--why not just have a call_task_impl
overload that takes a signature and an invocation and does this part?
Code quote:
assert(validate_invocation(
this->task_registry.get_bwd_signature(operator_node), invocation));
TaskArgumentAccessor accessor =
this->get_task_arg_accessor(invocation, operator_guid_t);
this->call_task_impl(invocation.task_id, accessor);
lib/local-execution/src/local_training_backing.cc
line 152 at r5 (raw file):
void LocalTrainingBacking::execute_update() { not_implemented();
Suggestion:
NOT_IMPLEMENTED();
lib/local-execution/src/local_training_backing.cc
line 158 at r5 (raw file):
TaskArgumentAccessor LocalTrainingBacking::get_task_arg_accessor(OpTaskInvocation invocation,
Isn't this just a pure function of the OpTaskInvocation
and the OpTaskSignature
? Why is the task registry getting all involved here?
lib/local-execution/src/ops/split.cc
line 96 at r5 (raw file):
auto attrs = acc.get_argument<SplitAttrs>(ATTRS); legion_coord_t num_blks, in_blk_size, out_blk_size[MAX_NUM_OUTPUTS];
"block" is much clearer than "blk"
Suggestion:
legion_coord_t num_blocks, input_block_size, output_block_size[MAX_NUM_OUTPUTS];
lib/local-execution/src/ops/split.cc
line 126 at r5 (raw file):
SimTaskBinding fwd_binding; fwd_binding.bind(INPUT, input.shape);
At some point this should just be replaced by the existing forward
function
lib/op-attrs/include/op-attrs/ops/embedding.h
line 13 at r5 (raw file):
namespace FlexFlow { enum class AggregateOp { SUM, AVG, NONE };
Just use a std::optional<AggregateOp>
if you want this functionality
Suggestion:
enum class AggregateOp { SUM, AVG };
lib/pcg/include/pcg/computation_graph.h
line 19 at r5 (raw file):
OutputLabelledMultiDiGraph<Layer, Tensor>> { using strong_typedef::strong_typedef;
Make most/all of these methods functions
lib/pcg/include/pcg/computation_graph.h
line 20 at r5 (raw file):
using strong_typedef::strong_typedef; std::vector<operator_guid_t> traverse() {
Move to .cc
file? (same with the other methods here)
lib/pcg/include/pcg/computation_graph.h
line 27 at r5 (raw file):
} std::vector<operator_guid_t> traverse_reverse_order() {
Pretty much all of these should be const
lib/pcg/include/pcg/computation_graph.h
line 35 at r5 (raw file):
} bool out_edge_comparator(MultiDiOutput x, MultiDiOutput y) {
Since this appears to only be used in sort_edge_set
, just use a lambda there rather than defining a whole new method?
lib/pcg/include/pcg/computation_graph.h
line 50 at r5 (raw file):
[&](MultiDiOutput const &e) -> tensor_guid_t { return tensor_guid_t{e}; });
Also, what is this function doing? The name is not at all clear
Suggestion:
std::unordered_set<MultiDiOutput> outputs =
transform(
sorted_by(
edges,
compare_by([](MultiDiEdge const &e) { return e.src_idx; })
),
[](MultiDiEdge const &o) {
return tensor_guid_t{o};
}
);
lib/pcg/include/pcg/computation_graph.h
line 53 at r5 (raw file):
} std::vector<tensor_guid_t> get_outgoing_tensors(operator_guid_t n) {
Why sort? I'm not sure I understand why something as get_outgoing_tensors
should be doing things like sorting
lib/pcg/include/pcg/computation_graph.h
line 61 at r5 (raw file):
} operator_guid_t add_node(Layer const &layer) {
Suggestion:
add_layer
lib/pcg/include/pcg/computation_graph.h
line 75 at r5 (raw file):
} tensor_guid_t create_outgoing_edge_with_label(operator_guid_t node,
Aren't all edges labelled?
lib/pcg/include/pcg/computation_graph.h
line 89 at r5 (raw file):
MultiDiOutput input_view = input.value(); MultiDiEdge edge = {node.value(), NodePort{incoming_edge_dst_port++},
Fix usage of node ports (todo for me)
lib/runtime/include/runtime/model_training_instance.h
line 16 at r5 (raw file):
struct ModelTrainingInstance { TrainingConfig training_config;
Why this change?
lib/utils/include/utils/graph/node_port.h
line 17 at r5 (raw file):
using strong_typedef::strong_typedef; static NodePort construct_port(size_t idx);
Remove
lib/execution/include/local_execution/local_allocator.h
line 11 at r1 (raw file):
Previously, reyna-abhyankar (Reyna Abhyankar) wrote…
Done.
Not seeing it
lib/execution/include/local_execution/local_allocator.h
line 16 at r1 (raw file):
Previously, reyna-abhyankar (Reyna Abhyankar) wrote…
Yeah, I don't think this is needed. I wrote the allocator first and just added that as an API in case but didn't end up using it.
Then remove it?
lib/execution/include/local_execution/local_allocator.h
line 21 at r1 (raw file):
Previously, reyna-abhyankar (Reyna Abhyankar) wrote…
Yeah, I don't think this is needed.
Then remove?
lib/execution/src/ops/attention.cc
line 174 at r1 (raw file):
} static void forward_task(Task const *task,
Where did forward_task
go?
lib/execution/src/task_spec/local_backing/local_allocator.cc
line 37 at r1 (raw file):
Previously, reyna-abhyankar (Reyna Abhyankar) wrote…
What would the alternative be? My take is that once
LocalAllocator
goes out of scope, it should de-allocate all the memory it allocated, but perhapsdeallocate
can be private. What do you think?
The alternative would be that once LocalAllocator
goes out of scope nothing happens to those memory regions--it's that calling code's responsibility to clean it up (not arguing for this, just wanted to understand why you went with this way).
lib/execution/src/task_spec/local_backing/local_allocator.cc
line 38 at r1 (raw file):
LocalAllocator::~LocalAllocator() { for (auto it = this->ptr_memory_size_mapping.begin(); it != this->ptr_memory_size_mapping.end();) {
Why not just a for-in
(i.e., for ( .... : ... )
) loop?
lib/local-execution/include/arg_ref.h
line 58 at r5 (raw file):
: type_idx(type_index), ref_type(ref_type) {} std::type_index type_idx;
I assume this change was made to avoid depending on Legion::Serializer
?
lib/local-execution/include/config.h
line 49 at r5 (raw file):
}; using legion_mapping_tag_id_t = unsigned long;
Probably better to just create a wrapper type?
lib/local-execution/include/device_specific.h
line 13 at r5 (raw file):
DeviceSpecific() = delete; DeviceSpecific(T ptr_type) { // accessor
Why does this exist? Isn't DeviceSpecific::create
sufficient?
lib/local-execution/include/op_task_signature.h
line 44 at r5 (raw file):
struct OpTaskSignature { OpTaskSignature() = default;
Why? Generally avoid default construction unless there's a super obvious initial state, and I don't see what the super obvious initial state is for OpTaskType
lib/local-execution/include/op_task_signature.h
line 95 at r5 (raw file):
std::unordered_set<OpTensorSlotSpec> op_tensor_slots; }; // FF_VISITABLE_STRUCT_NONSTANDARD_CONSTRUCTION(OpTaskSignature,
Why?
lib/local-execution/include/profiling.h
line 15 at r5 (raw file):
std::optional<float> elapsed = profiling_wrapper<F, Ts...>(f, profiling, std::forward<Ts>(ts)...); // TODO -- local logger?
What does this comment mean? Why was this commented out?
lib/local-execution/include/tasks.h
line 173 at r5 (raw file):
std::string const &name, F const &func, std::optional<F const &> cpu_func = std::nullopt);
I don't think we get optional
s of references with std::optional
--I think this was a tl::optional
-only feature
lib/local-execution/include/tracked_allocator.h
line 10 at r5 (raw file):
struct TrackedAllocator : public IAllocator { TrackedAllocator(size_t);
TrackedAllocator
should just wrap an Allocator
rather than reimplementing all of LocalAllocator
lib/local-execution/include/tracked_allocator.h
line 20 at r5 (raw file):
private: const size_t total_memory_size;
(admittedly this should happen automatically when you format)
Suggestion:
size_t const total_memory_size;
lib/local-execution/src/tracked_allocator.cc
line 42 at r5 (raw file):
} Allocator get_tracked_memory_allocator(size_t total_memory_size) {
Suggestion:
Allocator create_tracked_memory_allocator(size_t total_memory_size) {
lib/local-execution/src/ops/attention.cc
line 124 at r5 (raw file):
int num_heads = get_piece_shape(weight_parallel_tensor_shape)[ff_dim_t(1)]; // MHAPerDeviceState per_device_state =
Delete?
lib/local-execution/src/ops/attention.cc
line 348 at r5 (raw file):
"Attention Bwd", bwd_signature<ATTENTION_BWD_TASK_ID>(), backward_task_impl);
How does this work with legion tasks? Doesn't backward_task_impl
have the wrong argument types?
lib/local-execution/src/ops/batch_matmul.cc
line 190 at r5 (raw file):
template <> OpTaskSignature fwd_signature<BATCHMATMUL_FWD_TASK_ID>() { OpTaskSignature fwd;
Why is OpTaskSignature
default constructible? Does it make any sense to have an OpTaskSignature
without a type
?
lib/local-execution/src/ops/linear.cc
line 149 at r5 (raw file):
"[Linear] backward_time = %.2lfms\n", per_device_state, (void *)input.get_float_ptr(),
Why is everything getting cast to a void*
here?
lib/local-execution/src/ops/replicate.cc
line 75 at r5 (raw file):
input_grad, output_grad, attrs.replicate_degree); // is this `num_replicas`?
I believe so
lib/local-execution/src/ops/reverse.cc
line 24 at r5 (raw file):
using namespace FlexFlow::Kernels::Reverse; using legion_coord_t = long long;
Why is legion being referenced here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 95 unresolved discussions (waiting on @reyna-abhyankar)
lib/runtime/src/task_invocation_compilation.cc
line at r1 (raw file):
Previously, reyna-abhyankar (Reyna Abhyankar) wrote…
So there were actually 2 files with the same name in different subfolders. This deleted one looks like a precursor to the newer file which is in
lib/execution/src/task_spec/task_invocation_compilation.cc
and has a bunch of Legion stuff. Let me know if that's correct (I haven't verified which one is actually older via commit history, so I'll do that if you can't immediately tell).
I think they both need to exist--this is the one for the actual compilation to legion objects, which shouldn't be part of local-execution
but is critical for runtime
lib/execution/include/local_execution/local_model_training_instance.h
line 34 at r1 (raw file):
Previously, reyna-abhyankar (Reyna Abhyankar) wrote…
This is a good paradigm. But I'm not connecting the dots for this example. Wouldn't we have already initialized
LocalModelTrainingInstance
before calling this function? That's what is expected forforward
,backward
, andupdate
.Actually, in that case shouldn't we do
LocalModelTrainingInstance::forward()
instead offorward(LocalModelTrainingInstance)
?
I guess the question to answer is: if the lifetime of LocalTrainingBacking
is different than the lifetime of LocalModelTrainingInstance
(i.e., we create a LocalModelTrainingInstance
and then only later do we actually initialize the backing), then should they be part of the same object? We generally try to avoid objects/structs having a whole bunch of state changes as that can't be tracked by the type system, and instead represent those state changes with different types so it can be checked by the type system (e.g., rather than having a FFModel
which has to have a computation graph constructed, then legion initialized, then run, etc. we have a ComputationGraph
object and a LegionBacking
object that get created as you go along, so that if you have a LegionBacking
you know for sure that legion has already been initialized)
lib/execution/include/local_execution/local_task_argument_accessor.h
line 37 at r1 (raw file):
Previously, reyna-abhyankar (Reyna Abhyankar) wrote…
Why not just
using SlotGradId = std::pair<slot_id, IsGrad>
?
Because that's just an alias--anywhere I am using SlotGradId
I can instead use a std::pair
directly (as is being done here) which means the types only make sense if I know what type a SlotGradId
actually is (so the alias doesn't really have much of a point), and it leads to code that looks like std::get<IsGrad>(my_slot_id)
rather than my_slot_id.is_grad
which is often easier to read. It's less of an issue here because pair
s are pretty small and you benefit from the fact that both of the elements have wrapper types that make their purpose quite clear, but this would definitely improve some other parts of the code where you don't have this advantage (I care about this occurence here a bit less for that reason)
It's a similar answer to "why provide Node
when you could just return int
?"--while int
is sufficiently expressive, this expressiveness actually makes the code harder to read (it's why we don't just pass around everything as a void *
🎚️)
…nto local-execution
Description of changes:
Implement local execution
Related Issues:
Linked Issues:
Issues closed by this PR:
infer_bwd_binding
#1043OpTaskInvocation
s againstOpTaskSignature
s #924Initial TODO
infer_bwd_binding
, task reg #1097Later TODO
OpTaskSignatureBuilder
etc.This change is